Out-of-Set i-Vector Selection for Open-set Language Identification

نویسندگان

  • Hamid Behravan
  • Tomi Kinnunen
  • Ville Hautamäki
چکیده

Current language identification (LID) systems are based on an ivector classifier followed by a multi-class recognition back-end. Identification accuracy degrades considerably when LID systems face open-set data. In this study, we propose an approach to the problem of out of set (OOS) data detection in the context of open-set language identification. In our approach, each unlabeled i-vector in the development set is given a per-class outlier score computed with the help of non-parametric KolmogorovSmirnov (KS) test. Detected OOS data from unlabeled development set is then used to train an additional model to represent OOS languages in the back-end. The proposed approach achieves a relative decrease of 16% in equal error rate (EER) over classical OOS detection methods, in discriminating in-set and OOS languages. Using support vector machine (SVM) as language back-end classifier, integrating the proposed method to the LID back-end yields 15% relative decrease in identification cost in comparison to using all the development set as OOS candidates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

R Submission to the 2015 NIST Language Recognition I - vector Challenge

This paper presents a detailed description and analysis of IR submission, which is among the top performing systems, to the 2015 NIST language recognition i-vector machine learning challenge. Our submission is a fusion of several sub-systems based on linear discriminant analysis (LDA), support vector machine (SVM), multi-layer perceptron (MLP), deep neural network (DNN), and multi-class logisti...

متن کامل

FUZZY BOUNDED SETS AND TOTALLY FUZZY BOUNDED SETS IN I-TOPOLOGICAL VECTOR SPACES

In this paper, a new definition of fuzzy bounded sets and totallyfuzzy bounded sets is introduced and properties of such sets are studied. Thena relation between totally fuzzy bounded sets and N-compactness is discussed.Finally, a geometric characterization for fuzzy totally bounded sets in I- topologicalvector spaces is derived.

متن کامل

Out of Set Language Modelling in Hierarchical Language Identification

This paper proposes a novel approach to the open set language identification task by introducing out of set (OOS) language modelling in a Hierarchical Language Identification (HLID) framework. Most recent language identification systems make use of data sources from other than target languages to model OOS languages. The proposed approach does not require such data to model OOS languages, inste...

متن کامل

Open-Set Language Identification

We present the first open-set language identification experiments using one-class classification models. We first highlight the shortcomings of traditional feature extractionmethods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One-Class Support Vector Machine using only a monolingual corpus f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016